Overview

In this module, we will learn how to work with structural topic models using the stm package. Structural topic models are useful for gaining insight into the structure of discourse in a particular corpus. For instance, you can use an STM to ask questions like:

In this case, we’ll use STM models to investigate between-party differences in discourse related to President Bill Clinton’s impeachment. Ultimately, Clinton was impeached by the House of Representatives. However, there was a strong partisan split in the vote, such that the majority of Democrats voted against impeachment and the majority of Republicans voted for impeachment.

Accordingly, we’ll use an STM model to ask questions like:

Data Preparation

First, we’ll load the tidy-format version of our data.


dat_tt_words <- read.fst('../data/tdta_clean_house_data_tidy.fst')

Next, we’ll remove stop-words and lemmatize.


dat_tt_words.cln <- dat_tt_words %>%
  anti_join(stop_words) %>% # Drop stop words
  filter(word != 'num') %>% # Drop token that we used to represent numbers 
  mutate(word_lemma = textstem::lemmatize_words(word)) # lemmatize
Joining, by = "word"

Now, we’ll calculate word counts. Here, we’ll count by doc_num, Party, and date in order to preserve these variables; however, what we’re really interested in is the counts of each token for each document. Because all documents have only one value of Party and date, conditioning the count on these variables doesn’t change our calculations.



dat_tt_words.cln <- dat_tt_words.cln %>% 
  count(doc_num, Party, date, word_lemma)

  

Now, we’ll subset our data. Specifically, we will focus on documents generated between September, 1998 and January, 1999. For reference, the key moments in this data, with regard to Clinton’s impeachment hearings, were in October and December of 1999.

dat_tt_words.cln.samp <- dat_tt_words.cln %>%
  filter(date >= as_datetime('1998-09-01') & date <= as_datetime('1999-01-31'))

dat_tt_words.cln.samp %>%
  distinct(Party, doc_num) %>%
  count(Party)
NA

Subsetting on this date range yields about 8,800 documents with roughly even samples for Republicans and Democrats. However, there are only 42 documents associated with Independents. For simplicity, we’ll focus only on Democrats and Republicans.

dat_tt_words.cln.samp <- dat_tt_words.cln.samp %>%
  filter(Party != 'Independent')

Train/Text Split

Finally, for our exploratory analysis, we’ll take a 50% training sample from our subset data.

dat_tt_words.cln.samp.train <- dat_tt_words.cln.samp %>%
  filter(doc_num %!in% doc_ids_test) # Select documents for training
Error in x %in% y : object 'doc_ids_test' not found

Data formating for STM models

Pre-processing for STM models

To fit our STM models, we’ll use the fantastic stm package. While stm has it’s own functions for processing text data, we’ll try to do most of our processing with tidytext, which allows us to maintain consistenty with other use cases.

First, we need to cast our tidy format text data into a so-called sparse document-term matrix. We’ll also take some other pre-processing steps to simplify the modeling process. Specifically, we’ll drop very common and very uncommon words. This can dramatically minimize the parameter space of the model (remember, it has distributions over every word for each topic) and mitigate challenges posed by sparsity.


# Cast to sparse matrix, which is valid for textmineR
train_dtm <- dat_tt_words.cln.samp.train %>%
  cast_sparse(doc_num, word_lemma, value=n)

# User textminor function TermDocFreq to extract term and document frequencies
tf <- TermDocFreq(dtm = train_dtm) %>%
    mutate(doc_prop = doc_freq/n_docs) 


# Exclude words that (1) occur in less than 1% of documents or (2) occur in more than 99% of documents .
words_to_keep <- tf %>%
  filter(doc_prop >= .01 & doc_prop <= .99)

cat(paste('N words: ', nrow(tf), '\nN words after filtering: ', nrow(words_to_keep), sep=''))
N words: 17910
N words after filtering: 1839
# Drop these words from our training data

dat_tt_words.cln.samp.train <- dat_tt_words.cln.samp %>%
  filter(word_lemma %in% words_to_keep$term)

By dropping uncommon and common words, we’ve decreased our vocabular by an order of magnitude. In practice, it’s important to do sensitivity analyses over different thresholds; but for our purposes, we’ll assume that this transformation doesn’t dramatically change the meaning of our documents.

Now that we’ve filtered out infrequent/frequent words, we’ll recast our DTM. We’ll also create a design matrix containing our covariates.


# Cast training data into sparse DTM 
train_sparse <- dat_tt_words.cln.samp.train %>%
  cast_sparse(doc_num, word_lemma, value=n)


# Create a date range from the min/max dates in our training data
date_grid <- tibble(date = seq(min(dat_tt_words.cln.samp.train$date), 
                               max(dat_tt_words.cln.samp.train$date), by='days')) %>%
  mutate(date_int = row_number()) # Associate each date with an integer


# Create design matrix 
train_X <- dat_tt_words.cln.samp.train %>%
  distinct(doc_num, Party, date) %>%
  left_join(date_grid) # Merge our dates with the date grid so that we can represent date as a sequenc of integers
Joining, by = "date"

Working with STM models

Fitting STM models

Now we’re ready to fit our STM models! Because we don’t know the true number of topics, K, we’ll fit a model over a grid of K values ranging from 5 to 60 at intervals of 5. So, in total, we’ll train 12 topic models. This takes quite a while to run, even on a powerful machine. To help minimize training time, I’m using the furrr package to train the models in parallel. However, even in parallel, this code takes a while to run. If you don’t want to run it now, you can load the final object, many_models, from the /models/ directory in our workshop directory.

To fit our STM models, we’ll specify a value of K, the sparse matrix we want to train the model on, our model for topic prevalence, and the data.frame that contains our covariates. Here, we’ll model topic prevelance as a function of Party, a binary factor, and time, which we’ll model with a spline. In practice, you may want to try different functional forms, e.g. perhaps for time. In this data, we do not really have a continuous (or even approximately continuous) measure of time, so it might make more sense to treat day, or week, as a categorical variable.

## This takes quite a while to run!
## To save time, you can just load the `many_models` object, which is saved as `stm_models.RDS` in the /modeles directory.
## 


# Parallel model fitting adapted from https://juliasilge.com/blog/evaluating-stm/

# Uncomment and run to set number of cores to be used for parallel processing
# options(mc.cores = 6)

# Setup env for multiprocessing
plan(multisession, gc = TRUE)

many_models <- data_frame(K = seq(5,60,5)) %>% # Initialize a column of values for K 
  mutate(topic_model = future_map(K,   # map values of K into `stm` in parallel 
                                  ~stm(train_sparse,  # Sparse matrix
                                       K = .,         # placeholder for K
                                       prevalence = ~ Party*s(date_int), # prevalence model
                                       data = train_X, verbose=F)))

#saveRDS(many_models, '../models/stm_models.RDS')
many_models <- readRDS('../models/stm_models.RDS')

Evaluating STM models

Now that we’ve trained our models, we’ll try to pick a specific model based on various measure of model fit/quality. In this case, we’re trying to decide on the optimal number of Topics.

Note: In practice, unless there is a very clear winner, it’s probably a good idea to conduct sensitivity analyses over models with different numbers of topics.

First, we’ll extract a bunch of model fit metrics:


# Adapted from https://juliasilge.com/blog/evaluating-stm/

heldout <- make.heldout(train_sparse) # Here we're setting aside some heldout data that we'll use to evaluate our model. 
                                      # However, really, this data should NOT be taken from our training data!


k_result <- many_models %>%
  mutate(exclusivity = map(topic_model, exclusivity),
         semantic_coherence = map(topic_model, semanticCoherence, train_sparse),
         eval_heldout = map(topic_model, eval.heldout, heldout$missing),
         residual = map(topic_model, checkResiduals, train_sparse),
         bound =  map_dbl(topic_model, function(x) max(x$convergence$bound)),
         lfact = map_dbl(topic_model, function(x) lfactorial(x$settings$dim$K)),
         lbound = bound + lfact,
         iterations = map_dbl(topic_model, function(x) length(x$convergence$bound)))

Now, let’s plot some of these metrics as a function of K:

k_result %>%
  transmute(K,
            `Lower bound` = lbound,
            Residuals = map_dbl(residual, "dispersion"),
            `Semantic coherence` = map_dbl(semantic_coherence, mean),
            `Held-out likelihood` = map_dbl(eval_heldout, "expected.heldout")) %>%
  gather(Metric, Value, -K) %>%
  ggplot(aes(K, Value, color = Metric)) +
  geom_line(size = 1.5, alpha = 0.7, show.legend = FALSE) +
  facet_wrap(~Metric, scales = "free_y") +
  labs(x = "K (number of topics)",
       y = NULL,
       title = "Model diagnostics by number of topics",
       subtitle = "These diagnostics indicate that a good number of topics would be around 50 or 60")

Ulimately, we want to pick a model that maximizes semantic coherence, roughly the likelihood that high probability words in a given topic co-occur in a high-probability document for that topic. However, at the same time, we want to minimize our residuals and maximize the held-out likelihood and the marginal probability of the data given the model, which is referred to, here, as the “lower bound”. Unfortunately, semantic coherence and these other metrics usually move in opposite directions.

In this case, based on our residuals, held-out likelihood, and lower bound plots, 50 <= K >= 60 is a good range. It also looks like these models are tied (at the lowest) levels of semantic coherence.

Coherence vs Exclusivity

Another important diagnostic is exclusivity, which represents the degree to which the highest probability words in a topic are exclusive or unique to that topic. This is a valuable complement to coherence, because you could maximize coherence by assigning the words with the highest marginal empirical probabilities to all topics (e.g. all topics place the most density on “the”, “and”, and “is”, for example). In this case, we could tell that this is a “bad” model by looking at the exclusivity scores for the model’s topics, which would be very low because all topics share their high probability words.


K_means <- k_result %>%
  select(K, exclusivity, semantic_coherence) %>%
  filter(K %in% c(40, 45, 50, 55, 60)) %>%
  unnest(cols = c(exclusivity, semantic_coherence)) %>%
  mutate(K = as.factor(K)) %>%
  group_by(K) %>%
  summarize_all(.funs = c(mean, sd))



k_result %>%
  select(K, exclusivity, semantic_coherence) %>%
  filter(K %in% c(40, 45, 50, 55, 60)) %>%
  unnest(cols = c(exclusivity, semantic_coherence)) %>%
  mutate(K = as.factor(K)) %>%
  ggplot(aes(semantic_coherence, exclusivity, color = K)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(x = "Semantic coherence",
       y = "Exclusivity",
       title = "Comparing exclusivity and semantic coherence",
       subtitle = "Models with fewer topics have higher semantic coherence for more topics, but lower exclusivity") +
  #facet_wrap(K~.) +
  geom_hline(data=K_means, aes(yintercept=exclusivity_fn1, color=K)) + 
  geom_vline(data=K_means, aes(xintercept=semantic_coherence_fn1, color=K)) 

NA
NA

It looks like the model with 60 topics has the highest average exclusivity and the 3rd highest average coherence. However, this is a bit hard to see, so we can also look at the point estimates.

K_means %>%
  rename(mean_exclusivity = exclusivity_fn1,
         mean_semantic_coherence = semantic_coherence_fn1,
         sd_exclusivity = exclusivity_fn2,
         sd_semantic_coherence = semantic_coherence_fn2) %>%
  select(K, mean_exclusivity, sd_exclusivity, mean_semantic_coherence, sd_semantic_coherence) %>%
  mutate_if(is.numeric, round, digits=2)

Interestingly, it looks like exclusivity is the same for K = 50, 55, and 60, though the SD is a little lower for K = 55 and 60. In contrast, the model with K = 60 actually has the 4th lowest semantic coherence.

Choosing a model

Ultimately, our model residuals and marginal fit statistics indicate that the model with 60 topics is the best. However, comparisons semantic coherence and exclusivity suggest that other models could be just as good, depending on how you define model success.

When it’s hard to identify a clear winner, you should almost always conduct sensitivity analyses across multiple models! If they all lead you to the same conclusion, then perhaps that conclusion warrants greater trust. However, if they all lead you to different conclusions, then you probably shouldn’t trust any of them!

However, for our purposes, we’ll choose the model with K=60 and proceed with our analyses.

stm_model.1.train <- k_result %>% 
  filter(K == 60) %>% 
  pull(topic_model) %>% 
  .[[1]]

stm_model.1.train
A topic model with 60 topics, 8930 documents and a 1839 word dictionary.

Exploring STM models

Now that we’ve selected a topic model, we can begin to answer some of our questions.

What topics are relevant to our corpus?

To take a high-level glance at the topics estimated by our model, we can use the stm plot function with type='summary. This plot orders topics by the marginal proportion (e.g. likelihood of occurance) and shows the top words associated with each topic.

plot(stm_model.1.train, type = "summary", xlim = c(0, .3), text.cex=1.5)

By default, the top words are defined as the words with the highest probability. However, we can also change this so that the top words are the words with the highest exclusivity score.


plot(stm_model.1.train, type = "summary", xlim = c(0, .3), text.cex=1.5, n=5, labeltype='frex')

Inter-topic correlations

In contrast to vanilla LDA models, STM models estimate a covariance matrix for the distribution of topics. We can use this covariance matrix to visualize associations among topics.

library(igraph)
cormat <- topicCorr(stm_model.1.train)
set.seed(123)
plot(cormat, niter=5000, repulserad=60^4*10, 
     edge.arrow.size=0.5, 
     vertex.label.cex=0.75, 
     vertex.label.family="Helvetica",
     vertex.label.font=2,
     vertex.shape="circle", 
     vertex.size=2, 
     vertex.label.color="black", 
     edge.width=0.5)

This is hard to read, so let’s try a different kind of plot:


library(reshape2)

melted_cormat <- melt(cormat$cor)

ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
geom_tile()

This is still hard to read! Because I’m primarily interested in associations with Topic 25, I’ll just plot the correlations with that topic.

library(plotly) 

p <- melted_cormat %>%
  filter(Var1 == 25 & Var2 != 25) %>%
  ggplot(aes(x = Var2, y = value)) + geom_text(aes(label=Var2)) + 
  ylab('Correlation') +
  xlab('Topic') +
  ggtitle('Correlations of Topic 25 with other topics')
  
ggplotly(p)

It looks like topic 25 is most strongly related to 35, 57, 53, so let’s take a closer look at those topics.

Evaluating topic content

Out of 60 topics, the one’s that seem most relevant to our questions about impeachment are 25, 35, 57, and 53. But, what do these topics mean? To get a better idea of their subjective meaning, we can look again at their top words.


plot(stm_model.1.train, type = "summary", xlim = c(0, .3), n=5, labeltype='prob', 
     topics = c(25, 35, 57, 53))

Examining relevant documents

Lookin at the top words associated with a topic is a good way to get an idea of what the topic represents. However, there is another crucial source of information: the documents most strongly associated with the topic. When trying to summarize topics, you should always look at the top words and the top documents.

We can do this using the findThoughts function, which prints the text of the documents most strongly associated with a particular topic. However, to do this, we first need to make our texts accessible. For simplicity, I’ll just examine the first 500 characters in each relevant document.


texts = dat_tt_words %>% # This tidy data.frame contains the original words (i.e. before we lemmatized)
  filter(doc_num %in% unique(dat_tt_words.cln.samp.train$doc_num)) %>% # Keep only the docs in our training data
  group_by(doc_num) %>% 
  summarize(text = str_sub(paste0(word, collapse=' '), 1,500)) %>% # Collapse the rows of words into a single cell
  ungroup()

findThoughts(stm_model.1.train, texts = texts$text, topics = c(25), n=3)

 Topic 25: 
     like most americans i believe the president's behavior was irresponsible inappropriate and deeply disappointing but like most americans i have concluded that his actions do not rise to standard of impeachment established by the framers of our constitution make no mistake the president is not above the law he can be sued in criminal or civil proceedings for his actions in this matter when he leaves office but as members of congress we have a unique responsibility and must adhere to the standards 
    equal justice means two things first every citizen including the least powerful like the plaintiff in the first civil case in which president clinton perjured himself has a right to demand truthful testimony under oath even when the defendant is the president secondly equal justice requires adherence to the rule of law by all americans including the most powerful further equal justice requires accountability by those who have committed perjury in this case accountability for perjury is provided 
    today we are faced with strong evidence that the president lied after swearing an oath to tell the truth we have only one legitimate remedy in front of us impeachment so with great remorse i will vote in favor of impeachment some people have said that impeaching the president is an extremist or radical position to those people i must ask is holding the president accountable for his actions extremist is expecting the president to tell the truth radical i submit to this house that the rational pos

findThoughts(stm_model.1.train, texts = texts$text, topics = c(35), n=3)

 Topic 35: 
     may i just intercede with a thought i have a couple of other members here who have been waiting they want to speak i would hope and i am sure that we all would agree that we perhaps could allow these members to speak but perhaps we could be brief and then conclude the day's business
    i was not planning to address the body at this time but a colleague just impugned the moderates who have decided not to vote their way as if we are somehow being pressured i would challenge anyone on this floor to name the moderates who have come to you and said we have been pressured i for one and my colleagues i have spoken to have said this is a vote of conscience and respect our vote of conscience as much as you are asking us to respect yours i think it is outrageous that my colleagues on th
    i yield myself such time as i may consume i simply want to say that i think this has been a good healthy discussion i appreciate the various points of view that have been presented we all clearly wants to clean up the environment that is not the issue here i commend the gentleman from california mr bilbray for coming forth i think it has been terrific that we have heard this debate because clearly it is a more complex issue than what initially meets the eye there are many facets to this discussi
findThoughts(stm_model.1.train, texts = texts$text, topics = c(57), n=3)

 Topic 57: 
     republicans in congress have a message to the president don't shut down the government republicans have been working with the administration since last spring to avoid a government shutdown i think we all agree it is not in the national interest to shut down the government but how tragic it would be if the president were to force such a shutdown to divert attention from other matters or to use it for political purposes as we head into the mid term elections republicans are willing to reach an ho
    republicans in congress have a message to the president do not shut the government down republicans have been working with this administration since last spring last spring to avoid a government shutdown this year i think we would agree that it is not in the national interests to shut down the government how tragic it would be if the president were to force a shutdown for political reasons republicans are willing to reach a compromise with the white house on our remaining differences just as we 
    we are now in the midst of another battle over the budget the president remains steadfast in his unwillingness to meet and try to find a way to work out a compromise so we can keep the government running the president expressed dismay that all num appropriation bills had not been passed by the congress and signed into law yet since num when the democrats controlled congress the congress failed entirely to pass all num appropriation bills num times that is right at least num times a democrat cong
findThoughts(stm_model.1.train, texts = texts$text, topics = c(53), n=3)

 Topic 53: 
     yes i would and i would like to add that the attorney general has appointed independent counsels for some of the periphery of this administration but whenever it gets close to the oval office or people close to the oval office there is a reluctance to go ahead and appoint an independent counsel instead of doing this piecemeal as has been the case by the justice department there should be one independent counsel to look at the whole campaign finance scandal the money that has come from all over t
    for over num years now despite overwhelming evidence the attorney general of the united states has refused to follow the law and the recommendations of her fbi director and the chief campaign finance prosecutor to appoint an independent counsel in the campaign finance scandal she has politicized the office over which she has control the justice department of the united states reports about disarray in this investigation at the justice department abound after num years of this investigation key p
    well the cia has finally admitted it and the new york times finally covered it the times ran the devastating story on saturday with the headline cia said to ignore charges of contra drug dealing in nums in a remarkable reversal by the new york times the paper reported that the cia knew about contra drug dealing and they covered it up the cia let it go on for years during the height of their campaign against the sandinista government among other revelations in the article were that the cia's insp
  • Topic 25:
  • Topic 35:
  • Topic 57:
  • Topic 53:

Hypothesis testing with STMs

Clearly, Topic 25 is the most strongly related to impeachment. So, let’s use the topic prevalence component of our model to estimate the distribution of Topic 25 over time and by party.

To do this, we’ll use stm’s estimateEffect function. Then, we’ll use tidystm’s function extract.estimateEffect to extract a tidy dataframe of the estimated effects.

Importantly, the estimateEffect function uses documents as units and the topic proportion as the outcome. Thus, our effects estimates reflect expected changes in the topic proportion for a given document, conditional on our covariates.

To get a better idea of what these effects imply, let’s visualize them.


label_dat <- data.frame(date = as.numeric(as_datetime('1998-10-08')), label='test', estimate=.2)
                        
effs %>%
  left_join(date_grid, by = c('covariate.value'='date_int')) %>%
  ggplot(aes(x = date, y = estimate)) + 
  geom_ribbon(aes(ymin=ci.lower, ymax=ci.upper, fill=moderator.value), alpha=.25) + 
  geom_line(aes(color=moderator.value)) +
  theme_apa() + 
  ylab('Topic Proportion') +
  xlab('Date') +
  geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2) + 
  geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2) +
  geom_label(aes(x = as_datetime('1998-10-08'), y=.32, label = "Impeachment Initiated")) +
  geom_label(aes(x = as_datetime('1998-12-19'), y=.32, label = "Impeachment Vote")) + 
  ggtitle('Estimated topic proportions by date and party') + 
  facet_wrap(topic~., ncol=1)

Now, let’s look specifically at the effects for the two days relevant to impeachment, the day the impeachment was initiated and the day it was voted on.

Investigating Topic Content

It seems clear that Democrats’ floor speeches were more relevant to Topic 25 than Republicans.

However, what if they speak about these topics in different ways? To investigate this hypothesis, we can estimate a new model that models topic content as a function of Party.

stm_model.2.train <- stm(train_sparse,
            K = 60,
            prevalence = ~ Party*s(date_int), # prevalence model
            content = ~ Party,
            data = train_X,
            verbose=F)

saveRDS(stm_model.2.train, '../models/stm_model_2_train.RDS')
stm_model.2.train <- readRDS('../models/stm_model_2_train.RDS')

Now, we’ll visualize between-party differences in topic content using the stm plot function with type='perspectives'.

plot(stm_model.2.train, type = "perspectives", topics = 25, n = 100)

In this figure, a word’s size indicates its association with the topic. Further, it’s position on the X-axis indicates it’s differential association with the specified levels of the covariate.

Validation with STM models

At this point, we’ve fit a bunch of STM models and picked one for further exploration. Based on what we observed, it seems that Democrats spoke more about impeachment on the day of the vote, but also that their discussions of impeachment focused more on “process”, whereas Republicans’ discussions of impeachment focused more explicitly on the president and words like “justice”, “truth”, and “lie”.

Given our understanding of the data, this makes sense! However, are these findings robust? To address this question, we will fit a new model on our held-out confirmation data. Ideally, we’d like to see the same topic structure and come to the same conclusions. However, if our conclusions deviate from our current expectations, we may need to accept the possibility that our conclusions are not reliable.

# Create design matrix 
test_X <- dat_tt_words.cln.samp.test %>%
  distinct(doc_num, Party, date) %>%
  left_join(date_grid) # Merge our dates with the date grid so that we can represent date as a sequenc of integers
Joining, by = "date"

stm_model.2.test <- stm(test_sparse,
            K = 60,
            prevalence = ~ Party*s(date_int), # prevalence model
            content = ~ Party,
            data = test_X,
            verbose=F)

First, let’s look at the top words for each topic.


plot(stm_model.2.test, type = "summary", xlim = c(0, .3), text.cex=1.5, n=5)

Uh oh, our nice “impeach” topic didn’t show up!

Extracting the beta matrix

One way to look for relevant topics is to identify topics that place the highest probability on relevant keywords, such as “impeachment”. To do this, we’ll extract the beta matrix from our model and identify the topics that place the highest probability on “impeachment”

Interesting, it looks like topic 50 has, by far, the greatest probability density over ‘impeachment’. Let’s look at this topic, along with 15, 46, and 35


plot(stm_model.2.test, type = "summary", xlim = c(0, .3), n=5, topics=c(50,15,46,35))

Okay, if you squint, Topic 50 could definitely be related to impeachment. Also, “Paula” in topic 46 is probably related to Paula Jones, the person who filed a sexual harassment lawsuit filed against Clinton.

Examining relevant documents

Let’s dig a bit deeper by looking at the relevant documents for each topic:


texts = dat_tt_words %>% # This tidy data.frame contains the original words (i.e. before we lemmatized)
  filter(doc_num %in% unique(dat_tt_words.cln.samp.test$doc_num)) %>% # Keep only the docs in our training data
  group_by(doc_num) %>% 
  summarize(text = str_sub(paste0(word, collapse=' '), 1,500)) %>% # Collapse the rows of words into a single cell
  ungroup()

findThoughts(stm_model.2.test, texts = texts$text, topics = c(50), n=3)

 Topic 50: 
     num years ago i walked into this chamber with awe as the son of an immigrant i was raised to believe in the majesty of our democracy and this is the citadel of that democracy today we are on the verge of weakening our democracy by abusing the most extraordinary tool our constitution affords us most constitutional scholars and most of the american people simply do not believe that the president's offenses as bad as they are rise to the level of impeachment yet we are about to set a dangerous prec
    like most americans i believe the president's behavior was irresponsible inappropriate and deeply disappointing but like most americans i have concluded that his actions do not rise to standard of impeachment established by the framers of our constitution make no mistake the president is not above the law he can be sued in criminal or civil proceedings for his actions in this matter when he leaves office but as members of congress we have a unique responsibility and must adhere to the standards 
    i have not to this point formally announced how i would vote on these four articles of impeachment in reaching my decision i have weighed not only my constitutional duty and this president's fate but i have weighed what vote is the right one for the country at this time i have concluded that this president can and should continue in office for the remainder of his elected term in making my decision i have looked carefully at the words of our framers particularly the founder of my hometown of pat

This topic certainly seems relevant to impeachment; these (very few!) documents also suggest that maybe this topic is associated with not supporting impeachment.


findThoughts(stm_model.2.test, texts = texts$text, topics = c(35), n=3)

 Topic 35: 
     our courts of law and our legal system are the bedrock of our democracy and of our system of individual rights lying under oath in a legal proceeding and obstruction of justice undermine the rights of all citizens who must rely upon our courts to protect their rights if lying under oath in our courts and obstruction are ignored or they are classified as merely minor offenses then we have jeopardized the rights of everyone who seeks redress in our courts lying under oath is an ancient crime of gr
    with a commitment to the principles of the rule of law which makes this country the beacon of hope throughout the world i cast my vote in favor of the four counts of impeachment of the conduct of the president of the united states as a representative in congress i can do no less in fulfilling my responsibility to the constitution and to all who have preceded me in defending the constitution from erosions of the rule of law each of the impeachment counts concerns the public conduct of the preside
    with a commitment to the principles of the rule of law which makes this country the beacon of hope throughout the world i cast my vote in favor of the resolution to undertake an impeachment inquiry of the conduct of the president of the united states as a representative in congress i can do no less in fulfilling my trust responsibility to the constitution and to all who have preceded me in defending the constitution from erosions of the rule of law the impeachment inquiry is necessary to determi

This topic also seems to be about impeachment; however, it seems that the top documents express positions for impeachment.

findThoughts(stm_model.2.test, texts = texts$text, topics = c(46), n=3)

 Topic 46: 
     today the united states house of representatives begins debate on articles of impeachment of president william jefferson clinton this is the first time in over num years that the house of representatives has performed its solemn duty of determining whether a sitting president should be impeached article ii section num of our constitution reads the president vice president and all civil officers of the united states shall be removed from office on impeachment for and conviction of treason bribery
    i rise in support of the articles of impeachment private sexual relations between consenting adults should be just that private if president clinton had simply been revealed to have had an extra marital affair the u.s house of representatives would not be considering articles of impeachment unfortunately the president's troubles arise from a number of actions quite different from private consensual sexual encounters before the president even knew monica lewinsky he was the defendant in a civil l
    i quote do you solemnly swear in the testimony you are about to give that it will be the truth the whole truth and nothing but the truth so help you god that is the oath president clinton took before his august numth testimony of this year the president answered i do and despite repeated attempts by deputy independent counsel sol wisenberg to warn him of the consequences of providing false or misleading testimony the president went on to make perjurious statements pertaining to his relationship 

This also appears to be about impeachment, though with greater focus on lying and proceedings.

findThoughts(stm_model.2.test, texts = texts$text, topics = c(15), n=3)

 Topic 15: 
     by direction of the house republican conference i call up a privileged resolution a href billnumth congresshouse resolutionnum h res num a and ask for its immediate consideration the clerk read the resolution as follows a href billnumth congresshouse resolutionnum h res num a resolved that the rules of the house of representatives of the one hundred fifth congress including applicable provisions of law or concurrent resolution that constituted rules of the house at the end of the one hundred fif
    most respectfully i thank you for recognizing me and permitting me to act expeditiously in a matter that i wish to bring to the attention of the house pursuant to rule ix i hereby give notice of my intention to offer a resolution as a question of the privilege of the house the form of my resolution is as follows and i shall try to be as expeditious as possible impeaching kenneth w starr an independent counsel of the united states appointed pursuant to num united states code section num b of high
    at the conclusion of this debate i will offer a motion to recommit the resolution offered by the gentleman from illinois to the committee on the judiciary with the instruction that the committee immediately report to the house the resolution in the form of our democratic alternative while we would have preferred that democrats have a normal opportunity to present our resolution as a amendment the procedure being used by the house today does not make a democratic amendment in order the motion to 

This appears to be less relevant to impeachment (as we have been thinking about it) and more relevant to proceedings.

  • 35:
  • 46:
  • 35:
  • 15:

Hypothesis testing on validation set

While these topics are different from what we observed in our training data, they still seem relevant. Let’s go ahead and visualize the effects of our covariates:

library(tidystm)

prep.test = estimateEffect(c(50, 46, 35) ~ Party*s(date_int), stm_model.2.test, meta=test_X)

  
effs.test <- purrr::map(c('Democratic', 'Republican'), # Levels of moderator
                   ~extract.estimateEffect(prep.test, # effects estimate object
                                           "date_int", # The IV we want to look at
                                           model = stm_model.2.test, # Our STM model
                                           moderator='Party', # Moderator 
                                           moderator.value = .)) %>% # Moderator levels, which we specify via `map`
  do.call('rbind', .) # Here, we rbind the mapped lists, which yields a single DF

label_dat <- data.frame(date = as.numeric(as_datetime('1998-10-08')), label='test', estimate=.2)
                        
effs.test %>%
  left_join(date_grid, by = c('covariate.value'='date_int')) %>%
  mutate(topic = recode(topic, `50` = 'Oppose impeachment', `46` = 'Lying', `35` = 'Support impeachment' )) %>%
  ggplot(aes(x = date, y = estimate)) + 
  geom_ribbon(aes(ymin=ci.lower, ymax=ci.upper, fill=moderator.value), alpha=.25) + 
  geom_line(aes(color=moderator.value)) +
  theme_apa() + 
  ylab('Topic Proportion') +
  xlab('Date') +
  geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2) + 
  geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2) +
  geom_label(aes(x = as_datetime('1998-10-08'), y=.32, label = "Impeachment Initiated")) +
  geom_label(aes(x = as_datetime('1998-12-19'), y=.32, label = "Impeachment Vote")) + 
  ggtitle('Estimated topic proportions by date and party') + 
  facet_wrap(topic~., ncol=1)

Investigating Topic Content on validation set

Finally, we can take a look at party differences in topic content. First, let’s look at Topic 50, the topic we’re thinking of as associated with opposition to impeachment.

plot(stm_model.2.test, type = "perspectives", topics = 50, n = 100)

This looks somewhat similar to the perspectives plot of Topic 25 in our training data. However, there are some key differences, too.

Now let’s look at Topic 35, the topic we’re thinking of as associated with supporting impeachment:

plot(stm_model.2.test, type = "perspectives", topics = 35, n = 100)

Interesting, it looks like, at least sometimes, Democrats might be talking about other issues (e.g. not Clinton’s impeachment) in the context of this topic.

Finally, let’s look at Topic 46, the topic that seems to be a little more generally about “lying”.

plot(stm_model.2.test, type = "perspectives", topics = 46, n = 100)

Here, we can see that Republican’s are more likely to mention Lewinsky and Jones, key people in the case against Clinton. Their use of this topic also places more density on words like “lie” and “perjury”.

Conclusions

---
title: "Theory Driven Text Analysis Workshop"
subtitle: "Topic Models\n\nSPSP 2020"
author: 
  name: "Joe Hoover & Brendan Kennedy"
  email: "joseph.hoover@kellogg.northwestern.edu\n\nbtkenned@usc.edu"
output:
  html_notebook:
    toc: yes
---


```{r, echo=F, message=F, warning=F}
# Define chunk options
knitr::opts_chunk$set(echo=T, message=F, warning=F)
```

```{r, message=F,  echo=F}
# Load packages 
library(pacman)
p_load(readr, dplyr, tidyr, ggplot2, pmr, jtools, knitr, reshape2, jsonlite, 
       lubridate, stringr, tidytext, fst, textstem, tm, quanteda, topicmodels, textmineR,
       stm, furrr)

# If error on install topic models, this post might help:
# https://stackoverflow.com/questions/25759007/error-installing-topicmodels-package-non-zero-exit-status-ubuntu

'%!in%' <- function(x,y)!('%in%'(x,y))

```

# Overview 

In this module, we will learn how to work with structural topic models using the `stm` package. Structural topic models are useful for gaining insight into the structure of discourse in a particular corpus. For instance, you can use an STM to ask questions like:

* What topics are relevant in a corpus
* Does the distribution of topics change depending on key covariates (e.g. experimental condition, the political affiliation of the speaker, etc)
* Does the content of a topic (i.e. it's distribution over words) change depending on key covariates

In this case, we'll use STM models to investigate between-party differences in discourse related to President Bill Clinton's impeachment. Ultimately, Clinton was impeached by the House of Representatives. However, there was a strong partisan split in the vote, such that the majority of Democrats voted *against* impeachment and the majority of Republicans voted *for* impeachment. 

Accordingly, we'll use an STM model to ask questions like:

* When did documents (i.e. speeches) most relevant to impeachment occur?
* Which party produced more documents related to impeachment?
* What, if any, between party differences are there in discourse about impeachment?


# Data Preparation

First, we'll load the tidy-format version of our data.
```{r}

dat_tt_words <- readRDS('../data/tdta_clean_house_data_tidy.RDS')

```


Next, we'll remove stop-words and lemmatize.

```{r}

dat_tt_words.cln <- dat_tt_words %>%
  anti_join(stop_words) %>% # Drop stop words
  filter(word != 'num') %>% # Drop token that we used to represent numbers 
  mutate(word_lemma = textstem::lemmatize_words(word)) # lemmatize

```


Now, we'll calculate word counts. Here, we'll count by doc_num, Party, and date in order to preserve these variables; however, what we're really interested in is the counts of each token for each document. Because all documents have only one value of Party and date, conditioning the count on these variables doesn't change our calculations. 

```{r}


dat_tt_words.cln <- dat_tt_words.cln %>% 
  count(doc_num, Party, date, word_lemma)

  
```

Now, we'll subset our data. Specifically, we will focus on documents generated between September, 1998 and January, 1999. For reference, the key moments in this data, with regard to Clinton's impeachment hearings, were in October and December of 1999.


```{r}
dat_tt_words.cln.samp <- dat_tt_words.cln %>%
  filter(date >= as_datetime('1998-09-01') & date <= as_datetime('1999-01-31'))

dat_tt_words.cln.samp %>%
  distinct(Party, doc_num) %>%
  count(Party)

```

Subsetting on this date range yields about 8,800 documents with roughly even samples for Republicans and Democrats. However, there are only 42 documents associated with Independents. For simplicity, we'll focus only on Democrats and Republicans. 


```{r}
dat_tt_words.cln.samp <- dat_tt_words.cln.samp %>%
  filter(Party != 'Independent')

```


## Train/Text Split 

Finally, for our exploratory analysis, we'll take a 50% training sample from our subset data.


```{r}
doc_ids <- unique(dat_tt_words.cln.samp$doc_num) # Get document IDs

n_docs = .50 * length(doc_ids) # Calculate number of documents to sample

set.seed(1231) # set seed for reproducibility

doc_ids_test = sample(doc_ids, n_docs) # sample document IDs for test data

dat_tt_words.cln.samp.train <- dat_tt_words.cln.samp %>%
  filter(doc_num %!in% doc_ids_test) # Select documents for training

dat_tt_words.cln.samp.test <- dat_tt_words.cln.samp %>%
  filter(doc_num %in% doc_ids_test) # Select documents for test


```


## Data formating for STM models

### Pre-processing for STM models

To fit our STM models, we'll use the fantastic `stm` package. While `stm` has it's own functions for processing text data, we'll try to do most of our processing with tidytext, which allows us to maintain consistenty with other use cases. 

First, we need to cast our tidy format text data into a so-called `sparse` document-term matrix. We'll also take some other pre-processing steps to simplify the modeling process. Specifically, we'll drop very common and very uncommon words. This can dramatically minimize the parameter space of the model (remember, it has distributions over *every* word for *each* topic) and mitigate challenges posed by sparsity.

```{r}

# Cast to sparse matrix, which is valid for textmineR
train_dtm <- dat_tt_words.cln.samp.train %>%
  cast_sparse(doc_num, word_lemma, value=n) 

# User textminor function TermDocFreq to extract term and document frequencies

tf <- TermDocFreq(dtm = train_dtm) %>%
    mutate(doc_prop = doc_freq/n_docs) 


# Exclude words that (1) occur in less than 1% of documents or (2) occur in more than 99% of documents .
words_to_keep <- tf %>%
  filter(doc_prop >= .01 & doc_prop <= .99)

cat(paste('N words: ', nrow(tf), '\nN words after filtering: ', nrow(words_to_keep), sep=''))

# Drop these words from our training data

dat_tt_words.cln.samp.train <- dat_tt_words.cln.samp %>%
  filter(word_lemma %in% words_to_keep$term)

```

By dropping uncommon and common words, we've decreased our vocabular by an order of magnitude. In practice, it's important to do sensitivity analyses over different thresholds; but for our purposes, we'll assume that this transformation doesn't dramatically change the meaning of our documents. 

Now that we've filtered out infrequent/frequent words, we'll recast our DTM. We'll also create a design matrix containing our covariates.

```{r}

# Cast training data into sparse DTM 
train_sparse <- dat_tt_words.cln.samp.train %>%
  cast_sparse(doc_num, word_lemma, value=n)


# Create a date range from the min/max dates in our training data
date_grid <- tibble(date = seq(min(dat_tt_words.cln.samp.train$date), 
                               max(dat_tt_words.cln.samp.train$date), by='days')) %>%
  mutate(date_int = row_number()) # Associate each date with an integer


# Create design matrix 
train_X <- dat_tt_words.cln.samp.train %>%
  distinct(doc_num, Party, date) %>%
  left_join(date_grid) # Merge our dates with the date grid so that we can represent date as a sequenc of integers

```

# Working with STM models

## Fitting STM models

Now we're ready to fit our STM models! Because we don't know the true number of topics, K, we'll fit a model over a grid of K values ranging from 5 to 60 at intervals of 5. So, in total, we'll train 12 topic models. This takes quite a while to run, even on a powerful machine. To help minimize training time, I'm using the `furrr` package to train the models in parallel. However, even in parallel, this code takes a while to run. If you don't want to run it now, you can load the final object, `many_models`, from the `/models/` directory in our workshop directory. 

To fit our STM models, we'll specify a value of `K`, the sparse matrix we want to train the model on, our model for topic prevalence, and the data.frame that contains our covariates. Here, we'll model topic prevelance as a function of Party, a binary factor, and time, which we'll model with a spline. In practice, you may want to try different functional forms, e.g. perhaps for time. In this data, we do not really have a continuous (or even approximately continuous) measure of time, so it might make more sense to treat day, or week, as a categorical variable. 

```{r, eval=F}
## This takes quite a while to run!
## To save time, you can just load the `many_models` object, which is saved as `stm_models.RDS` in the /modeles directory.
## 


# Parallel model fitting adapted from https://juliasilge.com/blog/evaluating-stm/

# Uncomment and run to set number of cores to be used for parallel processing
# options(mc.cores = 6)

# Setup env for multiprocessing
plan(multisession, gc = TRUE)

many_models <- data_frame(K = seq(5,60,5)) %>% # Initialize a column of values for K 
  mutate(topic_model = future_map(K,   # map values of K into `stm` in parallel 
                                  ~stm(train_sparse,  # Sparse matrix
                                       K = .,         # placeholder for K
                                       prevalence = ~ Party*s(date_int), # prevalence model
                                       data = train_X, verbose=F)))

#saveRDS(many_models, '../models/stm_models.RDS')

```


```{r}
# Load the trained models
many_models <- readRDS('../models/stm_models.RDS')
```

## Evaluating STM models

Now that we've trained our models, we'll try to pick a specific model based on various measure of model fit/quality. In this case, we're trying to decide on the optimal number of Topics. 

**Note:** In practice, unless there is a very clear winner, it's probably a good idea to conduct sensitivity analyses over models with different numbers of topics. 

First, we'll extract a bunch of model fit metrics:

```{r}

# Adapted from https://juliasilge.com/blog/evaluating-stm/

heldout <- make.heldout(train_sparse) # Here we're setting aside some heldout data that we'll use to evaluate our model. 
                                      # However, really, this data should NOT be taken from our training data!


k_result <- many_models %>%
  mutate(exclusivity = map(topic_model, exclusivity),
         semantic_coherence = map(topic_model, semanticCoherence, train_sparse),
         eval_heldout = map(topic_model, eval.heldout, heldout$missing),
         residual = map(topic_model, checkResiduals, train_sparse),
         bound =  map_dbl(topic_model, function(x) max(x$convergence$bound)),
         lfact = map_dbl(topic_model, function(x) lfactorial(x$settings$dim$K)),
         lbound = bound + lfact,
         iterations = map_dbl(topic_model, function(x) length(x$convergence$bound)))

```

Now, let's plot some of these metrics as a function of K:

```{r}
k_result %>%
  transmute(K,
            `Lower bound` = lbound,
            Residuals = map_dbl(residual, "dispersion"),
            `Semantic coherence` = map_dbl(semantic_coherence, mean),
            `Held-out likelihood` = map_dbl(eval_heldout, "expected.heldout")) %>%
  gather(Metric, Value, -K) %>%
  ggplot(aes(K, Value, color = Metric)) +
  geom_line(size = 1.5, alpha = 0.7, show.legend = FALSE) +
  facet_wrap(~Metric, scales = "free_y") +
  labs(x = "K (number of topics)",
       y = NULL,
       title = "Model diagnostics by number of topics",
       subtitle = "These diagnostics indicate that a good number of topics would be around 50 or 60")
```

Ulimately, we want to pick a model that maximizes semantic coherence, roughly the likelihood that high probability words in a given topic co-occur in a high-probability document for that topic. However, at the same time, we want to minimize our residuals and maximize the held-out likelihood and the marginal probability of the data given the model, which is referred to, here, as the "lower bound". Unfortunately, semantic coherence and these other metrics usually move in opposite directions. 

In this case, based on our residuals, held-out likelihood, and lower bound plots, 50 <= K >= 60 is a good range. It also looks like these models are tied (at the lowest) levels of semantic coherence. 


### Coherence vs Exclusivity


Another important diagnostic is *exclusivity*, which represents the degree to which the highest probability words in a topic are *exclusive* or *unique* to that topic. This is a valuable complement to coherence, because you could maximize coherence by assigning the words with the highest marginal empirical probabilities to all topics (e.g. all topics place the most density on "the", "and", and "is", for example). In this case, we could tell that this is a "bad" model by looking at the exclusivity scores for the model's topics, which would be very low because all topics share their high probability words. 


```{r}

K_means <- k_result %>%
  select(K, exclusivity, semantic_coherence) %>%
  filter(K %in% c(40, 45, 50, 55, 60)) %>%
  unnest(cols = c(exclusivity, semantic_coherence)) %>%
  mutate(K = as.factor(K)) %>%
  group_by(K) %>%
  summarize_all(.funs = c(mean, sd))



k_result %>%
  select(K, exclusivity, semantic_coherence) %>%
  filter(K %in% c(40, 45, 50, 55, 60)) %>%
  unnest(cols = c(exclusivity, semantic_coherence)) %>%
  mutate(K = as.factor(K)) %>%
  ggplot(aes(semantic_coherence, exclusivity, color = K)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(x = "Semantic coherence",
       y = "Exclusivity",
       title = "Comparing exclusivity and semantic coherence",
       subtitle = "Models with fewer topics have higher semantic coherence for more topics, but lower exclusivity") +
  #facet_wrap(K~.) +
  geom_hline(data=K_means, aes(yintercept=exclusivity_fn1, color=K)) + 
  geom_vline(data=K_means, aes(xintercept=semantic_coherence_fn1, color=K)) 
  

```

It looks like the model with 60 topics has the highest average exclusivity and the 3rd highest average coherence. However, this is a bit hard to see, so we can also look at the point estimates.


```{r}
K_means %>%
  rename(mean_exclusivity = exclusivity_fn1,
         mean_semantic_coherence = semantic_coherence_fn1,
         sd_exclusivity = exclusivity_fn2,
         sd_semantic_coherence = semantic_coherence_fn2) %>%
  select(K, mean_exclusivity, sd_exclusivity, mean_semantic_coherence, sd_semantic_coherence) %>%
  mutate_if(is.numeric, round, digits=2)
```


Interestingly, it looks like exclusivity is the same for K = 50, 55, and 60, though the SD is a little lower for K = 55 and 60. In contrast, the model with K = 60 actually has the 4th lowest semantic coherence. 

### Choosing a model

Ultimately, our model residuals and marginal fit statistics indicate that the model with 60 topics is the best. However, comparisons semantic coherence and exclusivity suggest that other models *could* be just as good, depending on how you define model success. 

When it's hard to identify a clear winner, you should almost always conduct sensitivity analyses across multiple models! If they all lead you to the same conclusion, then perhaps that conclusion warrants greater trust. However, if they all lead you to different conclusions, then you probably shouldn't trust any of them!

However, for our purposes, we'll choose the model with K=60 and proceed with our analyses. 


```{r}
stm_model.1.train <- k_result %>% 
  filter(K == 60) %>% 
  pull(topic_model) %>% 
  .[[1]]

stm_model.1.train
```

# Exploring STM models

Now that we've selected a topic model, we can begin to answer some of our questions. 


## What `topics` are relevant to our corpus?

To take a high-level glance at the topics estimated by our model, we can use the `stm` `plot` function with `type='summary`. This plot orders topics by the marginal proportion (e.g. likelihood of occurance) and shows the top words associated with each topic. 

```{r, fig.height=15}
plot(stm_model.1.train, type = "summary", xlim = c(0, .3), text.cex=1.5)

```

By default, the *top* words are defined as the words with the highest probability. However, we can also change this so that the *top* words are the words with the highest exclusivity score.


```{r, fig.height=15}

plot(stm_model.1.train, type = "summary", xlim = c(0, .3), text.cex=1.5, n=5, labeltype='frex')

```


<div class="alert alert-success" role="alert">
  <strong>Question:</strong> Based on these plots, which topic(s) are most relevant to impeachment?
</div>



## Inter-topic correlations 

In contrast to *vanilla* LDA models, STM models estimate a covariance matrix for the distribution of topics. We can use this covariance matrix to visualize associations among topics. 

```{r, fig.height=8, fig.width=8}
library(igraph)
cormat <- topicCorr(stm_model.1.train)
set.seed(123)
plot(cormat, niter=5000, repulserad=60^4*10, 
     edge.arrow.size=0.5, 
     vertex.label.cex=0.75, 
     vertex.label.family="Helvetica",
     vertex.label.font=2,
     vertex.shape="circle", 
     vertex.size=2, 
     vertex.label.color="black", 
     edge.width=0.5)
```

This is hard to read, so let's try a different kind of plot: 

```{r}

library(reshape2)

melted_cormat <- melt(cormat$cor)

ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
geom_tile()

```

This is still hard to read! Because I'm primarily interested in associations with Topic 25, I'll just plot the correlations with that topic.

```{r}
library(plotly) 

p <- melted_cormat %>%
  filter(Var1 == 25 & Var2 != 25) %>%
  ggplot(aes(x = Var2, y = value)) + geom_text(aes(label=Var2)) + 
  ylab('Correlation') +
  xlab('Topic') +
  ggtitle('Correlations of Topic 25 with other topics')
  
ggplotly(p)
```

It looks like topic 25 is most strongly related to 35, 57, 53, so let's take a closer look at those topics. 

## Evaluating `topic` content

Out of 60 topics, the one's that seem most relevant to our questions about impeachment are 25, 35, 57, and 53. But, what do these topics *mean*? To get a better idea of their subjective meaning, we can look again at their top words. 


```{r, fig.width=8, fig.height=4}

plot(stm_model.1.train, type = "summary", xlim = c(0, .3), n=5, labeltype='prob', 
     topics = c(25, 35, 57, 53))

```


```{r, fig.width=8, fig.height=4}

plot(stm_model.1.train, type = "summary", xlim = c(0, .3), n=5, labeltype='frex', 
     topics = c(25, 35, 57, 53))

```



<div class="alert alert-success" role="alert">
  <strong>Question:</strong> Based on these plots, what do these topics mean?
</div>


### Examining relevant documents 

Lookin at the top words associated with a topic is a good way to get an idea of what the topic represents. However, there is another crucial source of information: the *documents* most strongly associated with the topic. When trying to summarize topics, you should *always* look at the top words *and* the top documents.

We can do this using the `findThoughts` function, which prints the text of the documents most strongly associated with a particular topic. However, to do this, we first need to make our texts accessible. For simplicity, I'll just examine the first 500 characters in each relevant document.



```{r}

texts = dat_tt_words %>% # This tidy data.frame contains the original words (i.e. before we lemmatized)
  filter(doc_num %in% unique(dat_tt_words.cln.samp.train$doc_num)) %>% # Keep only the docs in our training data
  group_by(doc_num) %>% 
  summarize(text = str_sub(paste0(word, collapse=' '), 1,500)) %>% # Collapse the rows of words into a single cell
  ungroup()

findThoughts(stm_model.1.train, texts = texts$text, topics = c(25), n=3)

```



```{r}

findThoughts(stm_model.1.train, texts = texts$text, topics = c(35), n=3)

```


```{r}
findThoughts(stm_model.1.train, texts = texts$text, topics = c(57), n=3)

```



```{r}
findThoughts(stm_model.1.train, texts = texts$text, topics = c(53), n=3)

```



<div class="alert alert-success" role="alert">
  <strong>Question:</strong> Based on these excerpts, as well as the topics' top words, how would you label the topics?
</div>

* Topic 25:
* Topic 35: 
* Topic 57: 
* Topic 53:


## Hypothesis testing with STMs

Clearly, Topic 25 is the most strongly related to impeachment. So, let's use the topic prevalence component of our model to estimate the distribution of Topic 25 over time and by party. 

To do this, we'll use `stm`'s `estimateEffect` function. Then, we'll use `tidystm`'s function `extract.estimateEffect` to extract a tidy dataframe of the estimated effects.

```{r}
library(tidystm)

prep = estimateEffect(c(25) ~ Party*s(date_int), stm_model.1.train, meta=train_X)

  
effs <- purrr::map(c('Democratic', 'Republican'), # Levels of moderator
                   ~extract.estimateEffect(prep, # effects estimate object
                                           "date_int", # The IV we want to look at
                                           model = stm_model.1.train, # Our STM model
                                           moderator='Party', # Moderator 
                                           moderator.value = .)) %>% # Moderator levels, which we specify via `map`
  do.call('rbind', .) # Here, we rbind the mapped lists, which yields a single DF

head(effs)

```

Importantly, the `estimateEffect` function uses *documents* as units and the topic proportion as the outcome. Thus, our effects estimates reflect expected changes in the topic proportion for a given document, conditional on our covariates. 

To get a better idea of what these effects imply, let's visualize them.


```{r}


effs %>%
  left_join(date_grid, by = c('covariate.value'='date_int')) %>%
  ggplot(aes(x = date, y = estimate)) + 
  geom_ribbon(aes(ymin=ci.lower, ymax=ci.upper, fill=moderator.value), alpha=.25) + 
  geom_line(aes(color=moderator.value)) +
  theme_apa() + 
  ylab('Topic Proportion') +
  xlab('Date') +
  geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2) + 
  geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2) +
  geom_label(aes(x = as_datetime('1998-10-08'), y=.32, label = "Impeachment Initiated")) +
  geom_label(aes(x = as_datetime('1998-12-19'), y=.32, label = "Impeachment Vote")) + 
  ggtitle('Estimated topic proportions by date and party') + 
  facet_wrap(topic~., ncol=1)

```



<div class="alert alert-success" role="alert">
  <strong>Question:</strong> What does this figure suggest?
</div>


Now, let's look specifically at the effects for the two days relevant to impeachment, the day the impeachment was initiated and the day it was voted on. 

```{r}

prep_marg = estimateEffect(c(25) ~ Party*s(date_int), stm_model.1.train, meta=train_X)


date_ints <- date_grid %>%
  filter(date == as_datetime('1998-10-08') | date == as_datetime('1998-12-19')) %>%
  pull(date_int)
  
party_effs <- purrr::map(date_ints, # Levels of moderator
                   ~extract.estimateEffect(prep, # effects estimate object
                                           "Party", # The IV we want to look at
                                           model = stm_model.1.train, # Our STM model
                                           moderator='date_int', # Moderator 
                                           moderator.value = .)) %>% # Moderator levels, which we specify via `map`
  do.call('rbind', .) # Here, we rbind the mapped lists, which yields a single DF



party_effs %>%
  left_join(date_grid, by = c('moderator.value'='date_int')) %>%
  ggplot(aes(y = estimate, x = as.factor(date), color=covariate.value)) + 
  geom_point(size=2, position=position_dodge(width=.25)) +
  geom_errorbar(aes(ymin=ci.lower, ymax=ci.upper), width=.1, position=position_dodge(width=.25)) +
  theme_apa() +
  ylab('Topic proportion') +
  xlab('Party') +
  ggtitle('Effect of Party on Topic 25 by date')


```




<div class="alert alert-success" role="alert">
  <strong>Question:</strong> What does this figure suggest?
</div>


## Investigating Topic Content

It seems clear that Democrats' floor speeches were more relevant to Topic 25 than Republicans. 

However, what if they speak about these topics in different ways? To investigate this hypothesis, we can estimate a new model that models topic content as a function of Party.   


```{r, eval=F}

stm_model.2.train <- stm(train_sparse,
            K = 60,
            prevalence = ~ Party*s(date_int), # prevalence model
            content = ~ Party,
            data = train_X,
            verbose=F)

saveRDS(stm_model.2.train, '../models/stm_model_2_train.RDS')

```

```{r}
stm_model.2.train <- readRDS('../models/stm_model_2_train.RDS')
```

Now, we'll visualize between-party differences in topic *content* using the `stm` plot function with `type='perspectives'`.


```{r, fig.height=8, fig.width=10}
plot(stm_model.2.train, type = "perspectives", topics = 25, n = 100)
```

In this figure, a word's size indicates its association with the topic. Further, it's position on the X-axis indicates it's differential association with the specified levels of the covariate. 


<div class="alert alert-success" role="alert">
  <strong>Question:</strong> What does this figure suggest?
</div>


# Validation with STM models 

At this point, we've fit a bunch of STM models and picked one for further exploration. Based on what we observed, it seems that  Democrats spoke more about impeachment on the day of the vote, but also that their discussions of impeachment focused more on "process", whereas Republicans' discussions of impeachment focused more explicitly on the president and words like "justice", "truth", and "lie". 

Given our understanding of the data, this makes sense! However, are these findings robust? To address this question, we will fit a new model on our held-out confirmation data. Ideally, we'd like to see the same topic structure and come to the same conclusions. However, if our conclusions deviate from our current expectations, we may need to accept the possibility that our conclusions are not reliable. 



```{r}

# Cast training data into sparse DTM 
test_sparse <- dat_tt_words.cln.samp.test %>%
  cast_sparse(doc_num, word_lemma, value=n)


# Create a date range from the min/max dates in our training data
date_grid <- tibble(date = seq(min(dat_tt_words.cln.samp.test$date), 
                               max(dat_tt_words.cln.samp.test$date), by='days')) %>%
  mutate(date_int = row_number()) # Associate each date with an integer


# Create design matrix 
test_X <- dat_tt_words.cln.samp.test %>%
  distinct(doc_num, Party, date) %>%
  left_join(date_grid) # Merge our dates with the date grid so that we can represent date as a sequenc of integers

```


```{r}

stm_model.2.test <- stm(test_sparse,
            K = 60,
            prevalence = ~ Party*s(date_int), # prevalence model
            content = ~ Party,
            data = test_X,
            verbose=F)

saveRDS(stm_model.2.test, '../models/stm_model_test.RDS')

```


First, let's look at the top words for each topic. 


```{r, fig.height=15}

plot(stm_model.2.test, type = "summary", xlim = c(0, .3), text.cex=1.5, n=5)

```

Uh oh, our nice "impeach" topic didn't show up! 


<div class="alert alert-success" role="alert">
  <strong>Question:</strong> Are there any topics related to our questions of interest?
</div>


## Extracting the `beta` matrix 

One way to look for relevant topics is to identify topics that place the highest probability on relevant keywords, such as "impeachment". To do this, we'll extract the `beta` matrix from our model and identify the topics that place the highest probability on "impeachment"

```{r}
td_beta <- tidy(stm_model.2.test, matrix='beta') %>% # We can extract a tidy dataframe containing the beta matrix from our model
  drop_na(topic) # For some reason, there are some NA rows for topics/terms. Should look into this!

td_beta %>%
  filter(term == 'impeachment') %>% # Pick a relevant word
  arrange(desc(beta)) %>% # Arrange by highest probability
  mutate(beta = round(beta, digits=2)) %>%
  head()
  
```

Interesting, it looks like topic 50 has, by far, the greatest probability density over 'impeachment'. Let's look at this topic, along with 15, 46, and 35


```{r, fig.height=5}

plot(stm_model.2.test, type = "summary", xlim = c(0, .3), n=5, topics=c(50,15,46,35))

```



<div class="alert alert-success" role="alert">
  <strong>Question:</strong> Are these topics related to impeachment? If so, in what way?
</div>


Okay, if you squint, Topic 50 could definitely be related to impeachment. Also, "Paula" in topic 46 is probably related to Paula Jones, the person who filed a sexual harassment lawsuit filed against Clinton. 

## Examining relevant documents 

Let's dig a bit deeper by looking at the relevant documents for each topic:

```{r}

texts = dat_tt_words %>% # This tidy data.frame contains the original words (i.e. before we lemmatized)
  filter(doc_num %in% unique(dat_tt_words.cln.samp.test$doc_num)) %>% # Keep only the docs in our training data
  group_by(doc_num) %>% 
  summarize(text = str_sub(paste0(word, collapse=' '), 1,500)) %>% # Collapse the rows of words into a single cell
  ungroup()

findThoughts(stm_model.2.test, texts = texts$text, topics = c(50), n=3)

```

This topic certainly seems relevant to impeachment; these (very few!) documents also suggest that maybe this topic is associated with *not* supporting impeachment.


```{r}

findThoughts(stm_model.2.test, texts = texts$text, topics = c(35), n=3)

```

This topic also seems to be about impeachment; however, it seems that the top documents express positions *for* impeachment. 


```{r}
findThoughts(stm_model.2.test, texts = texts$text, topics = c(46), n=3)

```

This also appears to be about impeachment, though with greater focus on lying and proceedings.



```{r}
findThoughts(stm_model.2.test, texts = texts$text, topics = c(15), n=3)

```

This appears to be less relevant to impeachment (as we have been thinking about it) and more relevant to proceedings. 



<div class="alert alert-success" role="alert">
  <strong>Question:</strong> What do these topics seem to be about?
</div>

* 35: 
* 46:
* 35: 
* 15: 

## Hypothesis testing on validation set

While these topics are different from what we observed in our training data, they still seem relevant. Let's go ahead and visualize the effects of our covariates: 

```{r}
library(tidystm)

prep.test = estimateEffect(c(50, 46, 35) ~ Party*s(date_int), stm_model.2.test, meta=test_X)

  
effs.test <- purrr::map(c('Democratic', 'Republican'), # Levels of moderator
                   ~extract.estimateEffect(prep.test, # effects estimate object
                                           "date_int", # The IV we want to look at
                                           model = stm_model.2.test, # Our STM model
                                           moderator='Party', # Moderator 
                                           moderator.value = .)) %>% # Moderator levels, which we specify via `map`
  do.call('rbind', .) # Here, we rbind the mapped lists, which yields a single DF

```


```{r, fig.height=6}

label_dat <- data.frame(date = as.numeric(as_datetime('1998-10-08')), label='test', estimate=.2)
                        
effs.test %>%
  left_join(date_grid, by = c('covariate.value'='date_int')) %>%
  mutate(topic = recode(topic, `50` = 'Oppose impeachment', `46` = 'Lying', `35` = 'Support impeachment' )) %>%
  ggplot(aes(x = date, y = estimate)) + 
  geom_ribbon(aes(ymin=ci.lower, ymax=ci.upper, fill=moderator.value), alpha=.25) + 
  geom_line(aes(color=moderator.value)) +
  theme_apa() + 
  ylab('Topic Proportion') +
  xlab('Date') +
  geom_vline(xintercept=as.numeric(as_datetime('1998-10-08')), linetype=2) + 
  geom_vline(xintercept=as.numeric(as_datetime('1998-12-19')), linetype=2) +
  geom_label(aes(x = as_datetime('1998-10-08'), y=.32, label = "Impeachment Initiated")) +
  geom_label(aes(x = as_datetime('1998-12-19'), y=.32, label = "Impeachment Vote")) + 
  ggtitle('Estimated topic proportions by date and party') + 
  facet_wrap(topic~., ncol=1)

```



<div class="alert alert-success" role="alert">
  <strong>Question:</strong> What does this figure suggest?
</div>


## Investigating Topic Content on validation set

Finally, we can take a look at party differences in topic content. First, let's look at Topic 50, the topic we're thinking of as associated with opposition to impeachment. 

```{r, fig.height=8, fig.width=10}
plot(stm_model.2.test, type = "perspectives", topics = 50, n = 100)
```

This looks somewhat similar to the perspectives plot of Topic 25 in our training data. However, there are some key differences, too.


Now let's look at Topic 35, the topic we're thinking of as associated with supporting impeachment:

```{r, fig.height=8, fig.width=10}
plot(stm_model.2.test, type = "perspectives", topics = 35, n = 100)
```

Interesting, it looks like, at least sometimes, Democrats might be talking about other issues (e.g. not Clinton's impeachment) in the context of this topic. 


Finally, let's look at Topic 46, the topic that seems to be a little more generally about "lying".

```{r, fig.height=8, fig.width=10}
plot(stm_model.2.test, type = "perspectives", topics = 46, n = 100)
```

Here, we can see that Republican's are more likely to mention Lewinsky and Jones, key people in the case *against* Clinton. Their use of this topic also places more density on words like "lie" and "perjury".


# Conclusions 


<div class="alert alert-success" role="alert">
  <strong>Question:</strong> Based on these analyses, what conclusions would you draw?
</div>

